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We develop an approach for feature elimination in empirical risk 
minimization and support vector machines, based on recursive elim- 
ination of features. We present theoretical properties of this method 



^ and show that this is uniformly consistent in finding the correct fea- 

_^*^ ture space under certain generalized assumptions. We present case 

QQ studies to show that the assumptions are met in most practical situa- 

tions and also present simulation studies to demonstrate performance 

' >. ' of the proposed approach. 

1—1 

j^ 1. Introduction. In recent years it has become increasingly easy to collect large amount of 

I information, especially with respect to the number of explanatory variables or 'features'. However 

^~~i the additional information provided by each of these features may not be significant for explaining 
> 

*^ the phenomenon at hand. Learning the functional connection between the explanatory variables 

J^ and the response from such high-dimensional data can itself be quite challenging. Moreover some 

■^ of these explanatory variables or features may contain redundant or noisy information and this 

^^ may hamper the quality of learning. One way to overcome this problem is to use variable selection 

J> (also referred to as feature elimination) techniques to find a smaller set of variables that is able to 

rN perform the learning task sufficiently well. 



In this work we discuss feature elimination in empirical risk minimization and support vector 
machines, focusing mainly on the latter. The popularity of support vector machines (SVM) as a set 
of supervised learning algorithms is motivated by the fact that SVM learning methods are easy-to- 
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compute techniques that enable estimation under weak or no assumptions on the distribution (see 
Steinwart and Chirstmann, 2008). SVM learning methods, which we review in detail in Section 2, 
are a collection of algorithms that attempt to minimize a regularized version of the empirical 
risk over some reproducing kernel Hilbert space (RKHS) with respect to some loss function. The 
standard SVM decision function typically utilizes all the input variables. Hence, when the input 
dimension is large, it can suffer from the so-called 'Curse of Dimensionality' (Hastie et al., 2001). 
A procedure for variable selection is thus of importance to obtain a more intelligible solution with 
improved efficiency. The advantages of variable selection are multi fold: it increases the generalized 
performance of the learning, it clarifies the causal relationship in the input-output space, and results 
in reduced cost of data collection and storage and better computational properties. 

One of the earliest works on variable selection in SVM was formulated by Guyon et al. (2002). 
Guyon et al. developed a backward elimination procedure based on recursive computation of the 
SVM learning function, known widely as recursive feature elimination (RFE). The RFE algorithm 
performs a recursive ranking of a given set of features. At each recursive step of the algorithm, it 
calculates the change in the RKHS norm of the estimated SVM function after deletion of each of 
the features remaining in the model, and removes the one with the lowest change in such norm. 
The process thus performs an implicit ranking of the features and can even be generalized to 
remove chunks of features at each step of recursion. A number of modified approaches have been 
developed since then, inspired by RFE (see Rakotomamonjy, 2003; Aksu et al., 2010; Aksu, 2012). 
Although there is no dearth of rich literature on RFE for SVMs, the theoretical properties of it have 
never been studied. The arguments for RFE have mostly been heuristic and its ability to produce 
successful data-driven performances in simulated or real-life settings. A key reason behind this lack 
of theory is the absence of a well-established framework for building, justifying, and collating the 
theoretical foundation of such a feature elimination method. This paper aims at building such a 
framework and validating RFE as a theoretically sound procedure for feature elimination in SVMs. 

Developing a theoretical structure for RFE is challenging. At each stage of the feature elimination 
process, we move down to a 'lower dimensional' feature space and the functional spaces need to be 
adjusted to cater to the appropriate version of the problem in these subspaces. Euclidean spaces, for 
example, as well as many specialized functional classes admit a nested structure in this regard, but 
as we will see later, this is not true in general. As mentioned before, SVM attempts to minimize 
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the empirical regularized risk within an RKHS of functions. Starting with a given RKHS, one 
daunting task is re-defining the functional space so that it retains the premises of the original space 
(i.e. admits the reproducing structure) and that these spaces remain cognate to one another. The 
basis for the theory on RFE depends heavily on correctly specifying these pseudo-subspaces, and 
a contribution of this paper is to formulate a way to do this. 

Another contribution of this paper is a modification of the criterion for deletion and ranking of 
features in Guyon et al.'s RFE to enable theoretical consistency. Here we develop a ranking of the 
features based on the lowest difference observed in the regularized empirical risk after removing 
each feature from the existing model. The definition of RFE used here can thus be generalized to the 
much broader yet simpler setting of empirical risk minimization where we can apply the same idea 
to the empirical risk. This can thus serve as a useful starting point for more in-depth theoretical 
analysis of feature elimination in SVM. While Guyon et al.'s RFE tends to rely on the penalization 
criterion in the SVM objective function for ranking features, our approach is risk-based, in that 
we utilize the entire objective function for ranking. The heuristic reasoning behind this is that if 
any of the features do not contribute to the model at all, the increase in the regularized risk will 
be inconsequential. 

In this paper, we show that the modified RFE is asymptotically consistent in finding the 'correct' 
feature space both for SVMs and empirical risk minimization (ERM) under reasonable regularity 
conditions. Although these regularity conditions are true for most of the relevant problems at hand, 
we show through appropriate examples that consistency results for RFE might fail in general, and 
for correct utilization of RFE as a consistent tool for feature elimination in SVMs, we need these 
regularity conditions to hold. The notion of consistency in such a context has not been defined 
previously. This paper also aims at positing a basis for which such results are meaningful. A 
comprehensive statistical analysis of SVMs can be found in Steinwart and Chirstmann (2008) 
(hereafter abbreviated SC08) which is used in this paper to develop the concept of consistency 
for RFE in the context of feature elimination in SVM and ERM. We give an in-depth analysis 
of a few case studies, including the setting of risk minimization in linear models and SVM for 
classification with a Gaussian RBF kernel, to show how the results developed here can be applied 
to specific examples. We also provide some simulation results to validate our theoretical conclusions 
and discuss how to utilize the proposed deletion criteria to select the important features in a given 
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setting. 

While RFE is a popular and simple method for variable selection, several other methods do exist 
in the context of feature elimination in SVMs. RFE is a classic example of a wrapper that uses 
the learning method itself to score feature subsets. Alternative wrapper-based selection methods 
have also been formulated for feature elimination in SVMs (Weston et al., 2001; Chapelle et al., 
2002). Other basic types of variable selection techniques include filters that select subsets of the 
feature space as a pre-processing step or embedded methods that construct the learning algorithm 
in a way to include feature elimination as an in-built phenomenon. Filters have been used for 
feature elimination in SVMs in many previous works (see for example Mladenic et al., 2004; Peng 
et al., 2005) . Embedded variable selection methods include redefining the SVM training to include 
sparsity (Weston et al., 2003; Chan et al., 2007). For example, Bradley and Mangasarian (1998) 
suggested the use of the /i penalty to encourage feature sparsity. Zhu et al. (2003) suggested an 
algorithm to compute the solution path for this /i-norm SVM efficiently. Other methods include 
introducing different penalty functions like the SCAD penalty (Zhang et al., 2006), the Iq penalty 
(Liu et al., 2007), a combination of Iq and /i penalty (Liu and Wu, 2007), the elastic net (Wang 
et al., 2006), the /oo norm (Zou and Yuan, 2006), and using a penalty functional in the framework 
of the smoothing spline ANOVA (Zhang, 2006). Although these alternative methods appear to 
perform well in practice, RFE still remains the most widely used methodology for feature selection 
in support vector machines due to its simplicity and generality. 

In Section 2, we give a short preliminary background for empirical risk minimization and support 
vector machines. In Section 3 we present the proposed version of RFE for ERM and SVM. In 
Section 4 we discuss the concept of feature elimination in these frameworks. In Section 5 we give 
the necessary assumptions for RFE in empirical risk minimization and support vector machines 
and provide a short discussion on the meaningfulness of these assumptions in varied situations. 
The associated consistency results for RFE are given in Section 6. In Section 7 we discuss our 
results in some known settings of ERM and SVM. In Section 8 we provide some simulation results 
to demonstrate how RFE works and how it can be used in intelligent selection of features. A 
discussion is provided in Section 9, detailed proofs are given in the Appendix, and the resources 
for the necessary codes are given in Supplement A. 
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2. Preliminaries. We start off with some preliminaries and define tlie notations tliat we will 
follow for the rest of the paper. We also give a brief introduction to support vector machines and 
empirical risk minimization. 

2.1. Empirical risk minimization. Empirical risk minimization (ERM) is a general setting of 
many supervised learning problems. 

Let the input space {X, A) be measurable, such that X C B cM. where B is an open Euclidean 
ball centered at 0, and 3^ C M and let P be a measure on X x y. A function L : X x y x M. t-^ 
[0, oo] is called a loss function if it is measurable. We say that a loss function is convex if L{x, y, •) 
is convex for every x £ X and y G y. A loss function is called locally Lipschitz continuous with 
Lipschitz local constant cl{-) if for every a > 0, 

sup \L{x,y,s) — L{x,y,s)\ < CLia)\s — s\ , s,s£[—a,a]. 

The loss function L is said to be Lipschitz continuous if there is a constant cl such that CL{a) < cl 
yaeR. 

For any measurable function f : X >-^M we define the L-risk of / with respect to the measure P 
as 7^L,p(/) = Ep[L{X, Y, f{X)]. The Bayes Risk TZ*^ p with respect to the loss function L and the 
measure P is defined as ini fTZL^p{f), where the infimum is taken over the set of all measurable 
functions, Co{X) = {/ : A' i— )■ M, /is measurable}. A function fp that achieves this infimum is 
called a Bayes decision function. 

Let J^ C Co{X) be a non-empty functional space, and L be any loss function. Let 

(1) /p,^ = aTgmmEp[L{X, Y, f{X)] = argmin7^L,p(/) 

feT feT 

be the minimizer of infinite-sample risk within the space J-'. Define the minimal risk within the 
space J-" as TZ*^^ p-p = 'R-L,p{fp,T)- Define the empirical risk TZl,d as IZLfiif) = "^niLiX, Y, f{X)) = 

n 

-Y,L{Xi,YiJ{X.i)). 

A learning method whose decision function fo.T satisfies TZL^DifD,^) = inf '^l,d(/) for all 
n > 1 and D = {{Xi, Yi), . . . , (X„, Yn)} £ {X x y)^ is called empirical risk minimization (ERM) 
with respect to L and J-". 
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2.2. Support vector machines. The results developed for SVM in this paper are valid not only 
for classification, but also for regression under some general assumptions on the output space y, 
however throughout this paper we would refer to all these versions as SVM. 

Let H be an M-Hilbert function space over X, then a function k : X x X >-^ M is called a 
reproducing kernel of H if k{-,x) S H for all x £ X and has the reproducing property that 
f{x) = {f,k{-,x)) for all f £ H and all G X. The space is called a Real-valued Reproducing 
Kernel Hilhert Space (RKHS) over X if for all x £ X, the Dirac functional 6x '■ H ^^ "K defined by 
^x{f) '■= f{x) is continuous for all f £ H (For more details refer to chapter 4 of SC08). 

Let L be a convex, locally Lipschitz continuous loss function and H he a separable RKHS of a 
measurable kernel k on X. Let D = {(Xi, Yi), . . . , (X„, Yn)} be a set of n i.i.d observations drawn 
according to the probability measure P and fix a A > 0. Define the empirical SVM decision function 
as 

(2) f D,x,H = aigminXWfWl +nLMf), 

where ^R-L^Dif) is the empirical risk defined as before. 

For a given A, define the SVM learning method £ as the map {X x y)^ x X ^^ M. defined by 

[D, x) I— )• fD,x,Hix), for all n > 1. We say that a learning method £ is measurable if it is measurable 

for all n with respect to the minimal completion of the product u-field on {X x 3^)" x X. Lemma 

6.23 of SC08, under the conditions given in Section 2.2 above yields that the corresponding SVM 

that produces the decision functions fD.x,H for D £ (X x 3^)" is a measurable learning method for 

all A > and for all n > 1. The maps D i— ?• fD,x.H mapping {X x y)^ to H are measurable. Since 

Lemma 2.11 of SC08 shows that the map H x X >-^M defined by (/, x) i— )■ f{x) is measurable, we 

therefore obtain measurability of {D,x) i— )■ fD,x,H {x). 

Define fpx H = argminA WfWjj + TZl p{f) and define the approximation error 
' ' feH 

(3) A^{X) = A \\fp,x,H\\l + nL,P Upxh) - inf nL,p{f)- 

2.3. Entropy Numbers. For {T,d) a metric space and for any integer n > 1, the n— th entropy 
number of (T, d) is defined as 

(4) en{T,d):=mf{ 



e > : 3si, . . . , S2n-i £ T such that T C M Bd{si, e) 



i=l 
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where B(i{s, e) is the bah of radius e centered at s, with respect to the metric d. Let S : E >-^ F he a 
bounded hnear operator between normed spaces E and F, then we write e„(5) = CniSBE, \\ • \\f), 
where Be is the unit ball in E. 

Note: If we have ||A;||oo < oo for a given kernel k, Lemma 4.23 of SC08 implies that every 
/ G H{k) is bounded which further implies that H C C^oi'^), where Coo{X) := {f : X ^ 
< oo}. 



3. Feature Elimination Algorithm. The original RFE (Recursive Feature Elimination) 

Method was proposed for SVMs by Guyon et al. (2002). The feature elimination procedure version 
we propose here is similar to the one in Guyon et al. except for the elimination criterion. While 
Guyon et al. use the criterion Hilbert space norm A||/||^ to eliminate features recursively, we use 
the entire objective function including the regularized Hilbert Space norm along with the empirical 
risk. Hence while Guyon's RFE is only applicable in analyses involving SVM, the modified RFE 
that we propose here can be used in ERM as well. 

3.1. The Algorithms. The RFE was originally developed for support vector machines, hence we 
provide the algorithm for SVM first. The definition for ERM follows similarly. First we define some 
related concepts. A detailed discussion on these will be given in Section 4. 

Definition 1. For any set of indices J C {l,2,..,d} and a given functional space T, define 
^ = {9 '■ 9 = f ° '^ )V/ G J-"}, where n is the projection map taking x to x (x,x S Mr), 
such that X is produced from x by replacing elements of x indexed in the set J , by zero. 

We define the space X^ = {t^'^" {x) : x E X}, such that t^'^" : X 1— )• X^ is a surjection. 

Definition 2. For a given RKHS H indexed by a kernel k and for a given J, define H'^ = 
HkonJ^i^)' where ko7T-^"{x,y) := /c(7r"'"(x),7r"^"(y)). 

Now we are ready to provide the algorithm. Assume the support vector machine framework, 
where we are given an RKHS H with respect to a kernel k. 

Algorithm 3. Start off with J = [■] empty and let Z = [1,2, ...,d]. 
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STEP 1: In the k cycle of the algorithm choose dimension ik for which 

(5) ifc = argminA foxH-J'^i-} ,,„„,+ "^L,!? (/d,a,//Ju{«}) - A ||/£,;,^j||^,, - 7^L,D (Z^,;,^,/) . 

STEP 2: Update J = JU {4}. Go to STEP 1. 

Continue this until the difference 
.min A foxHJ^i^} j^j^i^y +'T^L,D [fnxHJ^iy - ^ \\fD,x,HjfHJ-'^L,D {foxH') becomes larger 
than a pre- determined quantity 5n- 

Now for an empirical risk minimization framework with respect to a given functional space J-", 
Algorithm 3 can be modified to match the setting of ERM. 

Algorithm 4. Replace the regularized empirical risk ^WfnxH-'WfjJ ~^'^l,d {foxH') ^'^ ^^" 
gorithm 3, (defined for a given set of indices J) by the empirical risk TIl,d {fo T') ■ 

3.2. Cycle of RFE. We define 'cycle' of the RFE algorithm as the number of 'dimensions' 
deleted in one step of the algorithm. The algorithms in 3.1 has cycle = 1. But one can define it for 
cycles of value greater than 1 in which case one deletes chunks of dimensions at a time, equal to 
the size of the cycle. It can also be defined adaptively such that in different runs of the algorithm 
the cycle sizes are different. The theoretical results derived in this paper will hold for cycles of any 
size. Hence, for the sake of simplicity, we set the cycle size to 1. 

4. Functional Spaces on Lower Dimensional Domains. The aim of this section is to 
provide a detailed reasoning behind Definitions 1-2 in Section 3.1. 

4.1. Feature Elimination in ERM. Suppose we have a functional class J- C Coo{Xy, where X 
is as defined in Section 2 and let our goal be to find a function / within the functional class J-" 
which minimizes an empirical criterion (like empirical risk in ERM) . But if the dimension d of the 
input space is too large, it might lead to more complex solutions when in fact a simpler solution 
might be good enough. Now suppose that the minimizer of the appropriate infinite-sample version 
of the empirical criterion (like risk or expected loss in case of ERM and SVM) with respect to 



'^Note that the loss functions we consider in this paper (unless otherwise mentioned) are convex and locally 
Lipschitz with TZl,p{0) < oo, and hence by (2.11) and Proposition 5.27 of SC08, we have T?.^ p Ca^ix) ~ 1^l,p- Hence 
instead of Co{X) it suffices to consider the smaller subspace Ccxi{X). 
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the probability measure P on X x y and the functional class Coo{X), actually resides in Coo{'V*) 
where X* is a lower dimensional version of the given input space X. Then to avoid over fitting it 
is necessary that we try to find the empirical minimizer in a suitably defined lower dimensional 
version of J^. We define the lower dimensional adaptations of the original functional space as in 
Definition 1. 

First note that X'^ may not be a subspace of X, because for any x ^ X, tt'^''[x) may not be 
contained in X. Note that the assertion holds trivially for any Euclidean open ball B centered at 
0, and from Section 2 we have that X C B, for some B C M. . Hence we will assume that the 
functional space J-' can be sufficiently redefined as J-'b, where the domain of functions in J^b is B 
instead of X, such that J^b\x ~ -^ • This makes the functional classes T'^ well-defined, and unless 
otherwise mentioned, we will assume from hereon that X C X for all possible J. 

Note also that T'^ may not be a subspace of T . Although it is more desirable for these functional 
classes to accept a nested structure between each other, so that as we go down from a space to its 
lower dimensional version (that is, from T^ to J- '^ where J\ C J2, we can have the simple relation 
that T'^'^ C J-^^\ it does not hold in general. 

We now provide a few results that connect the space J- with its lower dimensional versions. 

\XJ 
to the space Ccxi{X^). Lemma 6 below shows that by defining the functional classes in this way, 

many of the nice properties of the functional class F are carried forward to the J- s. The proofs 

can be found in Appendix A.l and A. 2. 

Lemma 5. Ci^iX^) = Co^{XJ). 

Lemma 6. LetT <zCoo{X) he a non-empty functional subspace. Then for any J '^ {1^2, ... ,d}, 

1. If F is dense in Coo{X), then T is dense in C^{X). 

2. If J- is compact, then so is J- . 

3. ei{J-'^ , ||.||oo) ^ ei(J-", ||.||oo)j Vi > 1 where ei{J-, ||.||oo) is the i^^ entropy number of the set T 
with respect to the 1 1.|| 00 -?^orm as defined in Section 2.3. 



Note that our definition trivially implies that ^ \yj = {/| vj • / ^ -^} = •^\x-^' ^^^ i^ ^^ define 
^00 ('^) = {/ ° ^■^'^ • / ^ ^oo{X)}., then Lemma 5 says that C'l^{X'^) = C'l^{X)\y.j is isomorphic 
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4.2. Feature Elimination in SVM. In empirical risk minimization problems our primary focus is 
the empirical risk, whereas in the case of support vector machines we concentrate on the regularized 
version of the empirical risk, A||/|||^ +^l,d(/)- The minimization is typically computed over an 
RKHS H, that is, our objective is to find fn XH = argminA||/|||^ + TlLDif)- The regularization 
term A||/|||^ is used to penalize functions / with a large RKHS norm. Complex functions f ^ H 
which model too closely the output values in the training data set D, tend to have large //-norms 
(Refer to Exercise 6.7 in SC08 for a clear motivation). Again we assume that X <^ B CM.'^ where B 
is an open Euclidean ball centered at 0. We will also assume that we can sufficiently re-define the 
RKHS H as Hb , such that the domain of the functions in Hb is the Euclidean open ball B instead 
of X. So we can extend the domain of the kernel k of the RKHS H from X x X to B x B and from 
here onwards we assume X C X. The usual way that we defined the lower dimensional functional 
spaces in the previous section may not be sufficient here mainly because in SVM, the minimization 
is computed over an RKHS, and the properties of RKHSs dictate a lot of the statistical analyses. 
Hence we need to find a way to define them so that these spaces are RKHSs as well. 

First we review some properties of RKHS: 

n 

(1) The 'i/pro' space for an RKHS H with kernel k is defined as -f/pre ■= i / (y-ik{-,Xi) : n G 

^, ai, ... ^an ^^, xi, ... ,Xn ^ X>. H IS the completion of the space -ffpre (See Theorem 4.16 of 
SC08 for details). 

(2) Let S be any set and 99 : 5 i— )• Af be a map. Let fc : ^ x Af 1— )• M be the kernel on X . Then 
define the map k o (p : S x S ^^ M. as, k o ip{s, t) = k{ip{s), <f{t)). Observe that k o ip is a kernel on 
S (Paulsen, 2009, Proposition 5.13). 

The next theorem gives a natural relationship between the RKHS H{k) on X and the RKHS 
H{k o ip) on S. Also when 5 is a subset of X and p is the inclusion id map of S into X, then the 
kernel k o p is the restriction of the kernel k on S x S. 

Theorem 7. Let X and S be two sets and let k : X x X >-^ M be a kernel function on X and 
let ip : S >-^ X be a function. Then H(k o p) = {f o p : f ^ H{k)}, and for g G H{k o p) we have 
that \\g\\H{koip) = inf{||/||H(fc) : 9 = f °^]- 

See Paulsen (2009) for a proof of Theorem 7. 
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Now let Xq be a subset of X and k^^' {x, y) be the restriction of the kernel k on Xq and H^^o) {X) 
be the RKHS admitting k^^'{x,y) as its reproducing kernel and Hk{X) be the RKHS with its 
reproducing kernel k{x, y). Then by the above theorem and defining ip to be the inclusion id map 
from Xo to X, we have Hf^(o)(Xo) = {f\xo ■ f G Hk{X)} and \\g\\H^^o) = ™MII/llHfc : f\xo = 9} 
for 5 G Hf^miXo). 
For a given RKHS Hj.[X)^ we can now define these new functional spaces in the following ways: 

(Def 1) Projection of the Functional Space: We can define it as we did in the previous section. So 

define H( on X as, H( = H'^{X) = {f o tt^" : / e Hk{X)]. 
(Def 2) Projection of the kernel: -ff/ defined on X as ff/ — ^koirJ^i'^)- Note that by defining them 

like this, the new spaces that we obtain are all RKHSs on X. 

From the discussion below Theorem 7 and in Def 2 we have, 

(6) H^\;,j = H^o^j^iX') = {/U. : / e Hk{X)}. 
Also note from Def 1 that 

(7) H(\^j = H\X') = {/ o 7:''\^j : / G Hk{X)] = {/|^, : / G Hk{X)]. 

So we see that restrictions of both of these functional spaces to X are the same and the restriction 
space is itself an RKHS on X^ . Also note trivially that Hf^^^.j<^{X^) = Hj^^^.j<^{X) and hence from 
now onwards we would refer to the space Hf,^^jc[X) as simply H^ . 

Next we redefine Lemma 6 for the RKHSs. The proofs are similar and hence omitted: 

Lemma 8. Let H C Coo{X) he a non-empty RKHS on X . Then for any J C {1, 2, . . . , d}, 
1. If H is dense in Coo{X), then H is dense in Coo{X ). 



2. If the II • lloo closure Bh of the unit hall Bh is compact, then so is B^j . 

3. If H is separable, then so is H . 

4- ei{id : H i— )■ Loo{X)) < ei{id : H i— t- Lao{X)), where ei{id : H i— t- Lao{X)) is the i—th entropy 
numher of the unit hall Bh of the RKHS H , with respect to the \\ ■ \\oo-norm. 

In order to provide a heuristic understanding of the importance of the above projection spaces 
in feature selection, we give an alternative definition of lower dimensional versions of the input 
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space. First, define the map cr : M i— )■ W-" such that for x = {xi,...,Xrf} G M , a (x) = 
{^JminJ • • • j^Jmax} ^ KI'^L So (T'^(x) fs thc |J| dlmensional vector containing only those elements 
of X which are given in the index set J. Hence we can define the deleted input space X^"^ as, 
;Y-J := {a^^x) = {xjc^^^,. . .,xjc^^J -.xGX}. 

Consider the set up of Theorem 7, with X = X , and S = X~ . Consider the restricted 
kernel k^ on X^ with k^{x,y) = k{x,y) for all x,y G X"^ . Now for any y E X'"^ define the map 
if = (j)'^ : X~^ I—)- X^ as (t)j^{y) = 'K''^ {x) for any x ^ X satisfying y = a''" {x). Or in other words 
the map (/>^ takes an element from the deleted space, fills in the gaps with zeros and returns an 
element from the projected space. Note then that c/;^ is a bijection, and hence the spaces X^ and 
X~^ are isomorphic to each other. 

Hence from Theorem 7, we see that k ocf)^ is a kernel defined on X~ and with the corresponding 
RKHS Hy^jj on X^'^ . Suppose that instead of X , our input space is X~'^ . We want to know when 
can we define a kernel k~^ on X~^ such that it is the natural abridgment of the kernel k on X 
(in the sense of being able to define it on deleted vectors) and we want to know if there exists a 
natural connection between Hf.-.j{X^^) and H,j^,j{X~'^) in those cases. 

The motivation for the definition of k~'^ stems from previous works on feature elimination in 
Support Vector Machines. The Recursive Feature Elimination procedure developed in Guyon et al. 
(2002) and subsequently revisited and modified in Rakotomamonjy (2003) starts off with a given 
input space X and eliminates features using a weight criteria recursively computed by re-training 
the SVM on the lower dimensional spaces X~ . From their discussion, it is seen that if the Gram 
matrix of the training vectors {a;i,...,x„} is given by {^(xfc, x^)}^ •^;^, then the Gram matrix 
of the training vectors {x^*, . . . ,x~*} after deleting a particular variable say Xi is taken to be 
{k~^{xk^Xj)}'^j^i where k~'^{xk,Xj) = k{x^'^ , x~'^) . This clearly takes into account the assumption 
that the kernel k can be defined on deleted vectors as well, that is, k is well defined for any pair of 
vectors x and y where x, y € W^° and do < d. This is clearly not true in general for any kernel k on 
M . So we prefer to work with the projected spaces X instead of the deleted spaces X~ as this is 
more general. But we will show in the discussions below that in most practical cases as discussed 
in Guyon et al. (2002), and Rakotomamonjy (2003), many of the kernels that we work with satisfy 
an intrinsic relationship between k~ and k o 0^. Hence in those cases it is appropriate to work 
with either of the two setups. 
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4.2.1. Kernels in Statistical Learning. Most popular kernels in statistical learning can be cat- 
egorized into three main groups: translation invariant kernels, kernels originating from generative 
models (like Jaakkola and Haussler, Watsons) and dot product kernels (see Smola et al., 2000). In 
this paper we restrict our attention to only translation invariant and dot product kernels. 

• Translation Invariant Kernels: A translation invariant Kernel satisfies k(x,y) = g{x — y). 
The class of translation invariant kernels also includes Radial Kernels which satisfy k(x, y) = 
g{\\x — y|p), where ||a;|| is the usual Euclidean norm of vector x in its correct dimension. 

• Dot-Product Kernels: A dot-product kernel satisfies k{x,y) = g{{x,y)), where {x,y) is the 
standard inner product between vectors x and y in their correct dimension. 

Lemma 9. For Radial Kernels and Dot Product Kernels, k^"^ = k"^ o c/)-^. 

The proof is simple and therefore omitted. 

Also note that for kernels defined on weighted norms, {k{x,y) = g{\\x — y\\w) where ||x — y||vK •= 
(x — yyW{x — y), with W being a positive d x d diagonal matrix), the above condition is also 
satisfied. 

4.2.2. Universal Kernels. A continuous kernel k ona compact metric space X is called universal 
if the RKHS Hk{X) is dense in C{X), i.e., for every function g € C{X) and all e > there exists 
an / G //fc(Af) such that ||/ — g\\ac < e (where C{X) denotes the set of continuous functions from 
Af I— )• R). From Proposition 5.29 of SC08 we see that if Af is a compact metric space with Hk{X), 
the RKHS of a universal kernel A; on Af , P a distribution on X x y and L : A' x 3^ x M i— t- [0, oo) 
a convex, locally Lipschitz continuous loss with TIl,p{0) < oo, then we have that 7^^ ph ~ ^I p- 
Universal kernels produce particularly large RKHSs. 

4.2.3. Universality of Kernels:. For this we refer our Readers to Micchelli et al. (2006) where 
the notion of universality for most of the special types of kernels are discussed in details (including 
dot product and radial kernels). However we state two results on radial kernels here to show that 
under quite weak assumptions, all of the non trivial radial kernels are universal. 

(RKl) Representation of Radial Kernels: Schoenberg (1938) showed that a function k[x,y) : M x 
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M'^ h^ M defined as 

(8) k{x,y):=g{\\x-yf),x,yeW^, 

where || • || is the usual Euchdean norm, is a valid kernel on R*^ x W^ for all (i € N iff there 
exists a finite Borel measure /i on ]R_|_ such that for all t G M+, 

(9) g{t) := f e-'^df,ia). 

All kernels of this type are not universal. Indeed, the choice of a measure concentrated only at 
cr = gives a kernel k that is identically constant and therefore it is not universal. The next result 
shows that this is the only exceptional case. 

(RK2) Universality of Radial Kernels: If the measure /U in (8) is not concentrated at zero then the 
radial kernel k in (9) is universal (For proof see Micchelli et al. (2006)). 

4.3. Notion of risk in Lower Dimensional Versions of the Input Space. Note that the functional 
space J- (and equivalently RKHS H ) is defined on the entire input space X and not only on 
X'^ . So we can define risk for a function fj G J^"^ (or fj G H'^) for the entire input space X and 
not just for X'^ . Hence for a probability distribution P on X x y, define '7^l,p(/j) as 'R-L.pifj) = 
Jy J-^ L{y, X, fj(x))P{x, y)dxdy. This means that we can compare the risk of functions in different 
lower dimensional versions of the original functional space. 

5. Assumptions for RFE and Their Implications. In this section we discuss the assump- 
tions needed for consistency of RFE for both ERM and SVM. We then discuss validity of these 
assumptions under practical settings. 

5.1. Assumptions. Consider the setting of risk minimization (regularized or non regularized) 
with respect to a given functional space J- (which are typically RKHSs in case of SVM). Our 
main aim is to provide a framework where the modified recursive feature elimination method we 
proposed earlier is consistent in finding the correct lower dimensional subspace of the input space, 
and the assumptions required for this are: 
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(Al). Let J be a subset of {1, . . . ,d}. Let the function fpjrj minimize risk within the space J- 
with respect to the probabihty distribution P on X xy. Here, we define J- = J- . We assume 
that there exists a non-trivial J^^ that is J* 7^ and | J*| = d^ that satisfies the criterion that 
for any pair [dx^d'i) such that d\ < d^ < d^^^. J^^ and J^^g such that Jct^ C J^^ ^ J* with 
I JdJ = di and | J^al = c^2 such that TZlpj^j, = U^ ^^j^^ = 7^^ ^^j^^ • 

In other words, Assumption (Al) says that there exists a 'path' from the original input space X to 
the correct lower dimensional space X^* in the sense of equality of the minimized risk within the 
functional spaces T^s along this 'path'. So there exists a sequence of indices J from Jstart = to 

JcnA = J*, where J := {{Jstart = -h-, J2, ■ ■ ■ , J* = Jcnd} ■ ^1 ^ ^2 ^ • • • ^ ^cnd, \Ji\ = \Ji~l\ + l}; 

such that TZ*j- „ j-j is the same for all J ^ J . Note that J may not be unique and there might be 
more than one path leading to X * . Also note that J* may not be unique in general, but any one 
of them would work for our purpose. So we will assume it to be unique in this paper. 

N 

(A2). Let J7i,v72, . . . ^Jn be the exhaustive list of such paths from X to X'^* ^ and let J :=\\ji. 

There exists eo > such that whenever J ^ J , TZ*j^ pj^j > TZ*j^ pj^j* + ^o- 

In Section 6 we will show that Assumptions (Al) and (A2) are sufficient for a recursive feature 
elimination algorithm like RFE to work (in terms of consistency). Here we try to show the necessity 
of Assumption (Al) in order for a well-defined recursive feature elimination algorithm to work. 

5.2. Necessity of existence of a path in (Al). 

Example 10. Consider the empirical risk minimization framework. Let X = [—1,1]^ and let 
y = 0. Let Xi ~ U. where lA is some distribution on [—1, 1] and X2 = —Xi. Let the functional 
space J- he \^c{X\ +X2), c > 0}. Let the loss function be the squared error loss, i.e., L{x, y, f{x)) = 
{y - f{x)f- By Definition 1, F^^^ = {0X2, c> 0} and J^^^} = {^Xi, c> 0} and J^^^'^l = {0}. We 
see that lZL,p{fp,r) = TlL,p{fp^jr{i,2}) = but both TlL,p{fp^jr{i}) and lZL,p{fp^jr{2}) / 0. Hence 
even if the correct low- dimensional functional space may have minimized risk the same as that of 
the original functional space, if there does not exist a path going down to that space, the recursive 
algorithm will not work. Note that the minimizer of the risk belongs to F^^'"^' but there is no path 
from T to T^^"^^ , in the sense of (Al). 
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5.3. Necessity of Equality in (Al). It would appear that for the algorithm to work, we don't 
have to necessarily work with equalities along the path and that we can relax (Al) to include 
inequalities as well. Suppose we redefine (Al) as (Al*), where the equality of minimized risk along 
the path is replaced by '<'. So now we assume that minimized risk is not necessarily constant along 
the path, but that it does not increase. We show below that under this modified assumption, our 
recursive search algorithm might fail to find the correct lower dimensional subspace of the input 
space. 

Example 11. Let Y ~ U{-1, 1) and X C Mp such that Y = X^ = X2 + I = Xi - I. Let 
J- = {ciXi + C2X2 + C3X3, ci, C2, C3 > 1}, and let the loss function he squared error loss. Now by 
definition, J^^^^ = {C2X2 + C3X3, C2, C3 > 1}, -F^^} = {c^^i + C3X3, a, C3 > 1}, -F^^^ = {C2X2 + 
ciXi, ci, C2 > 1}, J-^i'^i = {C3X3, C3 > 1}, J-^i'^i = {C2X2, C2 > 1}, 7-^2,3} = {c,Xi, ci > 1}, and 

J-{1,2,3} ^ |o}. 

By simple calculations, we see that IZ^p-p = lZ*pj-[^T^ = 7^* {2} = 4/3, 7^* ^gj = 
7^* ^^2,3} = 1/3, T^X p ■p{i:i} = '^^px-{2.3} = 1 o,nd TZ* p j-[i2} = 0- Note that the correct di- 
mensional subspace of the input space is X^^'"^' and there exists paths leading to this space via 
X -^ X^i> -^ X^i'2} since Tilp^^ = Te^p^^o) > ^Ipj-{i.2} or via X -^ X^^} ^ ;^{L2} ^^^^g 
TZ*j^ pjr = IZ* p T-.{2} > TZ* p x-{i,2} in the sense of Assumption (Al*). But there also exists the blind 
path X —7- X^^' since TV'^pjr > TZ* p ^{3} which does not lead to the correct subspace. Hence the 
recursive search in this case may not be guaranteed to lead to the correct subspace. 

Hence equality in (Al) guarantees that the recursive search will never select an important di- 
mension j G J* for redundancy because then the Assumption (A2) would be violated. Hence the 
equality in (Al) will ensure that we will follow a path recursively to the correct input space X * . 

5.4. Validity of the Assumptions in Practical Situations. In this section we discuss the validity 
of the assumptions in 5.1, with respect to practical situations of risk minimization. 

In ERM, our main aim is to find a function / within a class J- that minimizes empirical risk 
within that class. Choice of the functional space J- is important as it determines a fine balance 
between complexity of the solution (see discussion in 4) on one hand and finding a function that 
has risk close to the Bayes Risk on the other. Often the spaces we consider for minimization satisfy 
properties that make the assumptions in Section 5.1 fairly natural. For SVM, the choice of the 
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RKHS H is just as important as it was in choosing a functional class J- in ERMs. Again, in most 
practical situations, the RKHS H will satisfy some properties that would make Assumptions (Al) 
and (A2) quite standard conditions for feature elimination. 

5.4.1. Nested spaces in empirical risk minimization. Often the space J-" we consider for ERM is 
such that for any J, T'^ C J^, that is, it admits the nested property. So for any Ji, J2 G {1, 2, . . . , d} 
with Ji C J2 we have J-"^ C J^'^i. This means we also have nested inequalities in the form of 
TZ* j_^ < T^*Tpjr.j2 ^^^' such Ji and J2. One example is the linear combination space where the 
coefficients are allowed to take values in a compact interval containing 0, T = {/(xi, . . . ,Xd) = 
"^ittiXi : \ai\ < M, M < 00}. In these cases, simple observation shows that Assumption (Al) 
translates to saying that there exists a minimizer fp^jr which minimizes infinite-sample risk in J^, 
satisfying the criterion that fp^jr £ Tj^ which further implies that /p, j- € Tj for any J <^ J^. Then 
the results of Section 5.1 imply that for any J C J,,, 'R*j^p^j = T^^pjrj, = T^ipjr and for any 
Jo ^ J*, T^*^ pj^Jo ^ ^1 PTJ* + ^0 for eo as defined in (A2). 

Even if the space J- does not satisfy the nested criterion, we can create the nested structure for 
feature elimination by considering the unions of these spaces. Noting that J- = J- , we can create 
them as follows: 

(10) ^-^ = IJ T^\ 

JCJ*C{l,...,d} 

It can be seen that the properties of J-" and T'^s with respect to Lemma 6 are carried forward in 
our new definitions too. If J- is dense in Coo{'^), by Lemma 6 we have J-""^ dense in C'^{X) for any 
J, which implies that T is dense in C^{X). If J^ is compact, Lemma 6 implies that J-" is compact 
for any J, hence J-^ and J-'^s are compact too, since the unions only include finitely many terms. 
If we have ei{T, ||.||oo) < 00 then again by Lemma 6, ei{T, ||.||oo) < 00 and ei{F^ , IMIoo) < 00 for 
allJ. 

5.4.2. Nested RKHSs. Unfortunately, in general, RKHSs need not be nested in each other. As 
we discussed in ERM, given any RKHS i/, we cannot create unions of RKHSs to use them in 
learning, because unions of RKHSs may not be RKHSs. Here we discuss cases where the naturally 
occurring RKHSs are in fact nested within each other. We will see that dot-product kernels actually 
have this property. To see this, let us consider a dot-product kernel k such that /c(x, y) = g{{x, y)) 
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where (• , •) is the usual Euchdean inner-product. Now let us consider the pre-RKHSs Hpy^c and 
i?pj,g. We show here that -ff™ ^ -f^pro which will imply that H C H. For this, take / G -f^pre which 

n 

implies that / can be written as /(•) = N^ Uik {-^Xi) for n € N, ai, . . . ,a„ G M, xi, . . . ,x„ G Af. 

1=1 
Hence, 



i=l i=l i=l 

n n 

= '^o:ig{{-,Tr-^\xi))) ='^aik {■,7r'^\xi)) 



Xi 



i=l i=l 



Noting that 'K'^''{xi), . . . ,7r'^''{xn) G X, we have that / G -ffprc- In a similar way, we can show that 
for any Ji C J2, H-^^ C iJ-^i. 

In the case of RKHSs produced by dot-product kernels (and potentially other RKHSs satisfying 
the nested property), the implications of (Al) and (A2) will be the same as discussed before (see 
the above section on nested spaces in 5.4.1), and we omit the details here. 

5.4.3. Dense spaces in empirical risk minimization. Another wide class of function spaces we 
typically consider in ERM are dense spaces. So if we have that J- is dense in C^oi^), Lemma 
6 gives us that T"^ is dense in C'^{X) for any J G {1,2, ... ,d}. First note that for Ji and J2 
with Ji C J2, we trivially have C'^{X) Q C^{X) C C^(X). This also means that for such Ji 
and J2 we have TZ* j^ > TZ* j_^. Now 'denseness' does not necessarily imply 'nestedness', 
but we have the 'almost nested' property in the sense that for any g G J-""^^, and for any e > 0, 
3 /e G T"^^ with ll/e — g\\oo < e. This means that for J=k as defined in (Al), 3 {fn} G T, such 
that fn{x) — )• fipjrjtix). Hence in terms of (Al) since the loss functions we consider are locally 
Lipschitz continuous, by Lemma 2.17 of SC08 we have 'R-L,pifn) — ^ T^*^ pt-j* ~ ^L pf- ^l^o '^ote 
that (Al) implies that for any J '^ J*, IZ*^ pj-,, > TZ*^ pr = ^^ pt-'* — ^l pj^J' ^^^^^ ^^^ every 
J '^ J*, 'lZ*j^ PJ.J— ^L PT-^* ^^*^ ^^®° ^°^ ^^y Jo '^ J*-: TZ*j^ pjrjo ^ ^^ pj^J* ~^ ^0 for eo as defined 
in (A2). 

5.4.4. Dense RKHSs. Most of the times the RKHS we would consider for SVMs will also be 
dense in Coo{'^)- Note that the properties of these dense spaces with respect to our Assumptions 
(Al) and (A2) as discussed above will remain the same here as well. All universal kernels produce 
RKHSs that are dense in Coo{X) with respect to convex, locally Lipschitz continuous losses and 



A NOTION OF CONSISTENCY FOR RFE IN SVMS AND ERMS 19 

RKl and RK2 imply that all non-trivial radial kernels share this property as well. Hence in most 
situations, our Assumptions (Al) and (A2) are a natural way to define a premise that necessitates 
feature elimination. 

6. Consistency Results for RFE. In this section we show that Algorithms 3 and 4 defined 
in Section 3.1 are consistent in finding the correct feature space under the assumptions in 5.1. We 
will refer often to SC08 for the theory and results developed in their text. 

6.1. Theoretical Results. Before we state the main results of this section, we note down some 
necessary conditions for general ERMs, and for SVMs. We start with ERM: 

(Bl). Let L be a convex locally Lipschitz continuous loss function. 

(B2). Let T C jCooI-^) be non-empty and compact. 

(B3). There exists M > satisfying \\f\\^ < M, f € T. 

(B4). There exists B > satisfying L{x,y, f{x)) < B, {x,y) e X x y, f e T . 

(B5). Assume that for fixed n>l, there exists constants a > 1 and p G (0, 1) such that 

IEda^^p^Cj (J-", Loo(-D;f )) < ai '^p , i > 1; where E/j^^pi is defined as the expectation 

with respect to the product measure P^ under the assumption that the input data Dx = 
{Afi, . . . , Xn} o,re i.i.d. copies of X ^ P^. 

We now define conditions for SVMs: 

(CI). Let P be a probability measure on X x y where the input space X is a valid metric space. 
(C2). Let L : X X y xM h^ [0,oo] be a convex locally Lipschitz continuous loss function satisfying 

L{x,y,0) < 1 for all {x,y) e X x y. 
(C3). Let H be the separable RKHS of a measurable kernel k on X with \\k\\oa < 1. 
(C4). Assume that for fixed n > 1, 3 constants a > 1 and p G (0, 1) such that 

^Dx'-p^iei{id:H^Loo{Dx))<ai''^p, i > 1. 

(C5). For a sample size of n we choose a Xn £ [0, 1] such that A„ — )■ and lim A„n = oo. 

n— ^-oo 

(C6). 3 c > and /3 G (0, 1] such that A^(A) < cA'^ for any J and for all X > (where A^{X) = 
Af{X)). 

Conditions (Bl) - (B5) will be used for proving the results for ERM, while (C2) - (C6) will be 
used for proving the results for SVM. Note that these are standard assumptions that are typically 
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used in the statistical analyses of empirical risk minimization and support vector machines (Refer 
to Chapter 6 and 7 of SC08). 

Also note that the conditions L{x,y,0) < 1 in (C2), and ||A;||oo < 1 for the kernel k in (C3) are 
assumed for simplicity and might be too restrictive in some settings, but equivalent conditions like 
L(x, y, 0) < M and ||/i;||oo ^ ^sup for constants M, fcsup > 1 are good enough for the proofs and will 
result in bounds differing from the ones derived here only up to some constants. 

Theorem 12. Assume (Bl) - (B5). For 5^^ = 0{n2) with (5„ — ;> 0, we have the following: 

1. The Recursive Feature Elimination Algorithm for empirical risk minimization, defined for 
{5n} given above, will find the correct lower dimensional subspace of the input space with 
probability tending to 1. 

2. The function chosen by the algorithm achieves the best risk within the original functional 
space J- asymptotically. 

Theorem 13. Assume (CI), - (C6). If we take (5^^ = 0(n 2/3+1) with 5n -^ 0, then we have 
the following: 

1. The Recursive Feature Elimination Algorithm for support vector machines, defined for {5n} 
given above, will find the correct lower dimensional subspace of the input space with probability 
tending to 1, . 

2. The function chosen by the algorithm achieves the best risk within the original RKHS H 
asymptotically. 

We will give a detailed proof of Theorem 12 later and it will be seen that proof of Theorem 13 
will be similar. But first, we provide a few relevant results which we will need for proving these 
main results. We start off with the following lemma. 

Lemma 14. Let (J^, || • ||jr) be a separable functional space, such that the metric \\ ■ \\jr dom- 
inates pointwise convergence. Also we assume sup ||/||j- < C for some C < 00 for all f £ J^. 
Let L be a convex, locally Lipschitz loss function such that L{x,y, f{x)) < B for some B < 00 
for all f £ T . Also assume that for fixed n > 1, 3 constants a > 1 and p € (0, 1) such that 
^Dx~P"^i {^j^oo{F)x)) < ai ^p , i > 1. Then, we have with probability greater than or equal to 
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1 - e-^ 

sup |7^L,p(/) - nLMf)\ < ^bJ- + —— 

f^jr y n 3n 

+ i max iCi{p)cLiC)^aPB'^'fn''^,C2ip)cLiC)^a^B^n~^^\ . 

See Appendix A. 3 for a proof. We now provide the necessary results for ERM. 

Proposition 15. Assume conditions (Bl) - (B5). For all measurable ERMs and all e > 0, 
r > 0, and n>l, and for Ji, J2 ^ J such that Ji C J2 C J^, we have 

P" ( ^ G (^ X yr : \nL,D {fD,r-'2) - T^L,D (/d,.f^i) I < 12i?y^ + ^ + 24Ki T^) ' J 

> 1 - 2e-^, 

where Ki := max |fi/4, C7i(p)cl(C)ps1-p, C2(p)cl(C7)^B^|. 

The proof can be found in Appendix A. 4. 

Note that compactness of J-" impUes compactness of J-'^ for any J by Lemma 6. Compactness is 
important because along with continuity of TZl,d '■ ^oo{'^) — ^ [0, 00), it ensures the existence of an 
empirical risk minimizer. Also compactness of J-" implies that it is a closed and separable subset 
of Coo{X) for each J. Hence Lemma 6.17 of SC08 shows that there exists a measurable ERM for 
both classes T'^^ and T'^^. Now note that for any J, the quantities TZl,d {fD,j^-')^ T^*ipjrj ^iid 
'T^L,P {Id j^j) 9-^6 all less than or equal to B. Note also that, 

|7^L,D {fD,Tj) - 'R-Ip,tA < \T^L,D {fD,Tj) " 7^L,p Hd^^j) \ + |7^L,p ifo^^j) - n,p,^| 

< sup |7^L,p(/) - UlMDI + 2 sup |7^L,p(/) - tzlMDI ■ 

Consequently we obtain the following two corollaries: 

Corollary 16. Assume the conditions of Proposition 15. For any J and all measurable ERMs 
and all e > 0, T > 0, and n > 1, we have that 

D^ix^yr-. \n,,n {fo,r-') - K,P,^A < ^b^- + ^^ + 12^1 [j^) J > 1 - ^"^ 

where Ki is as before. Additionally if J ^ J , we can replace 7^* p^-j in the above inequality by 
V* 
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Corollary 17. Oracle Inequality for ERM: Assume the conditions of Proposition 15. 
For any J and all e > 0, t > 0, and n > 1, we have with P" probability > 1 — e~'^ , 

where Ki is as before. 

We now present similar results for SVMs. As before we start off the main proposition: 

Proposition 18. Assume conditions (CI) - (C4). For fixed A > 0, e > 0, r > 0, and n>l, 
and for Ji, J2 ^ J such that Ji '^ J2 Q J* , we have with probability P" not less than 1 — 2e~'^ , 

|2 , ^ / r \ ^ II c l|2 



11 



^ \\fD,\,HJ2 \\hJ2 + '^L,D {fD,\,H-J2) " A \\fD,X,H-^i Wh-'i ~ '^^'^ (/d.A.H-'i ) 

< 4^ (A) + ^^'^(A) + UB^ + 20B^ + 24K2B'-P (^) ' , 

where ^2^('^) ^''^^ ^2^W ^''"^ ^^^ approximation errors for the two separate RKHS classes H^ 
andRJ-^ P:=cl(A-i/2)a-i/2 + i^ and K2 := max |pP/4, Ci(p)cl(A-5)p, C2(p)cl(A-^)^| is a 
constant depending only on B, p and the Lipschitz constant cl(A^^'^). 

See Appendix A. 5 for a detailed proof of Proposition 18. 

Note that since P > 1 and -fC > BP/4, we have that if a^^ > A^n, 



(12) 



<3B< 12KB^^P H^ • 
\XPnJ 

Similarly, since B > 1 and K > BP/4, we have for a'^P > \Pn, 

M\fDXHA\HJ '^'^L,P {fD,\,Hj) -'^L,P,HJ - ^ ||/d,A,HJ 11//-' +'^L,D [foXH') +'^L,P [foXH^) 

7^L,p(0) + IZl^p Udxh^) < 1 + ^ < 2P 



< 



(13) 



< SKB^-P ( ^ 



a2p\-2 



Now note that for any J, we have 

A WIdxhAIhJ +'^l,d {foxH') ~ '^L,P,HJ 

< ^WIdXhAIh-T +'^L,P [foXH-j) -'^L,P,HJ + \'^L,P {foXH^) - '^L,D (/d,A,^0 I 

(14) <Ai{X) + 2 sup |7^L,p(/)-7^L,D(/)|+ sup |7^L,p(/) - 7^L,D(/)| • 
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Consequently, we obtain two corollaries for SVMs, similar to Corollaries 16 and 17: 

Corollary 19. Assume the conditions of Proposition 18. For any J and all e > 0, t > 0, and 
n > 1, we have with P" probability > 1 — e~'^ , 

9 /2r r / a'^P \ ^ 

where K2 is as before. Additionally, if J ^ J , we can replace TZZp-^j in the above inequality by 

Corollary 20. Oracle Inequality for SVM: Assume the conditions of Proposition 18. 
For any J and all e > 0, t > 0, and n > 1, we have with P" probability > 1 — e~'^ , 



A WIdxhAHj + '^L,P {foxH-j) - K,P,H^ < MW + ^B^ + ^ + 8^2i?^-^ (1^) ' , 
where K2 is as before. 

Proposition 15 and Corollaries 16, 17 are intended for ERM and will be used in proving Lemma 
21 which in turn will aid in proving Theorem 12. Similarly, Proposition 18 and Corollaries 19, 20 
developed for SVM will be used to prove the following Lemma 22, that will set up the premise for 
proving Theorem 13. 

We now provide Lemma 21 for ERM: 

Lemma 21. Assume the conditions of Proposition 15. Then the following statements hold: 

i. For Ji, J2 ^ J and Ji ^ J2 ^ J*, ^ {{^n\ > 0) — )• such that we have with probability 

greater than 1 - 2e"^, TZl,d {f 0,^-^2) < T^l,d {fD,Th) + £«• 
a. For Ji ^ J , J2 ^ J such that Ji C J2, ^ {e-n} > and ?„ — )• eo > 0, such that we have with 

probability greater than 1 - 2e~^, TZl^d (/D,jr./2) > T^L,D {fD,J^h) + ^n- 
Hi. Oracle Property for RFE in ERM: For a given J C {1, . . . , d} the infinite-sample risk 
of the function f^jrj, TIl,p {foF-')' converges in measure to TV'i^pjr (and hence to TZ}^ p if 
T is dense in Coo {'^)) iff J ^ J ■ 

Now we provide a similar version of the above lemma for SVM. We further assume that the 
regularization constant A.„ converge to and assume the rate for such convergence as given in 
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(C5). To explicitly establish rates for our algorithm we further assume that the bound on the 
approximation error A2{X) is as given in (C6). 

Lemma 22. Assume the conditions of Proposition 18. Also assume (C5) and (C6). Then the 
following statements hold: 

i. For Ji, J2 S i7 such that Ji C J2 C J^, 3 ({e„} > 0) — )• such that we have with probability 
greater than 1 — 2e~'^ , 

a. For Ji £ J and J2 ^ J and for Ji C J2, 3 {en} > and ?n — >■ eo > 0, such that we have 
with probability greater than 1 — 2e~'^ , 

Hi. Oracle Property for RFE in SVM: The infinite- sampled regularized risk for the empir- 
ical solution fD,Xn,HJ> K \\fD,\„,H-'\\Hj +T^L,P {fD,x„,Hj) Converges in measure to Til pH 
(and hence to TZ*^ p if the RKHS H is dense in Coo {'^)) iff J ^ -J ■ 

The proof of Lemma 22 in given in Appendix A. 6. Lemma 21 follows similarly and hence we 
only discuss briefly the results that we obtain from Lemma 21 in Appendix A. 7. 

We are now ready to prove Theorems 12 and 13. Since the proofs are similar we only provide 
the proof for Theorem 12 and discuss the changes required for the proof of Theorem 13. 

6.2. Proof of Theorem 12. 

Proof. (1) Let X * be the correct input space and J* be the correct set of dimensions to 
be removed with \J^\ = do- To prove the first part of Theorem 12, we show that, starting with 
the input space X , the probability that we reach the space X^* is 1 asymptotically. First let us 
assume that there exists only one correct 'path' from X to X * . Let J'° be that correct path and 
J° = { Jq = {•}, J^, . . . , J? = J=k}. For the sake of notational ease, we use the convention that 

T^L,D {fD,Tj) = '^L,D,j^J- 

So from (32) in Appendix A. 7 we have, 7^* jo — TZ* ,0 < e„ with probability at least 

L,D,T «+i L,D,T i 



l-2e-^, where e„ = 125^/2^71^^/2 _^ 20Srn"i + 24Kia%-^/^ Now let J^+i / J°,^ be any other J 
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such that J° C Jj+i with || Jj+i|| = || J°|| + 1, then from the proof of Lemma 21 in Appendix A. 7 we 
have 7^* „ ^r. , , —7^* ,0 > en — e„, with probabiUty at least 1 — 26^"^. Now if we choose t = o( 



n 



with r — )■ 00, then 5n = e„ satisfies the conditions that 5""^ = Oiji'i) with 5n — )■ 0. Now since eg 
is a fixed constant, ^N^^ > such that Vn > N^^, 25„ < eo- Hence, without loss of generality, we 
assume throughout the remainder of the proof that n > N^^. Then we have the condition that 
TZ* ^ ^j.,^ -IZ* ,0 > 5n with probability at least 1 - 2e~'^ . 
Then, 

P ( 'RFE finds the correct dimensions ') 
> P { 'RFE follows the path J° to the correct dimension space ') 
= P {Jo ■= Jq, Jl '■= Ji,---, Jdo '■= Jdo^ Jdo+1 '■= 0) 
= P{Jo :=J^)P{Ji:=r,\JS)---P {Jdo ■.= Jdo\ Jo ^---^Jdo-dP {Jdo+1 ■■=^Jo,---,J°do), 

where ^Jdo+i •= 0' nieans the algorithm stops at that step. Note that P {Jq := Jg) = 1 and then 
observe, 

P \Ji+i '■= Ji+i\Jo , ■ ■ ■ , Ji) 
= P (Jj+i := Ji-i-i\Ji) ( '.■ {Jq, • • • , Ji-i} have already been removed from the model) 

= p (n* JO -n* ,0 <6n, n* jo -n* j. < n* j, - n* jo vj*, . / j°+A 
> P ( 7^* JO -n* JO <5n, 5n< n* j, - n* jo vj', ^ / j°+A 
>i-p{n* jo -n* jo >6n] - y p [n* j. -n* ,o < sA 

\ ' ' / T« / TO \ ' ' / 

> 1 - 2e-^ -2{d-i- l)e-^ = 1 - 2(d - i)e-^. 

Also see that 

P {Jdo+i ■■= 0|^o°, • • • > J2o) = P f^: n ^^*+i - ^* ^.° > ^^ ^Jdo+^ ^ l-2{d-do)e-\ Hence, 

do 

P {'RFE finds the correct dimensions') > I I (l — 2{d — i)e~'^). 

i=0 
Now for T = o{n) with r — )• 00, P ( 'RFE finds the correct dimensions') — )• 1 as n — )• 00. 

Now let us prove the same assertion for the case when there is more than one correct 'path' from 

X to X'^*. Let iTi, . . . ,J^N be an enumeration of all possible such paths. Define 'C-sets' for a Ji 
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(where index i denotes the r cycle of RFE) as CS{Ji) := {Jj+i : Ji, Jj+i G Jkfor some k}. Now, 

P ('RFE finds the correct dimensions') 
> P (Jo := Jq , Ji := Ji € 05(70), ■■■^Jdo '■= Jdo ^ C^iJdo-l)^ Jdo+l '■= 0) 
= P (Jo := Jo°) P (Ji := Jl G C5(Jo°)| Jo°) • • • P [Jd, ■= J,°, G C5( J,°,_i)| J,°,_i) P (j,„+i := 0| J,°,) 

Again as before P (Jq := Jq) = 1 and P ( J^o+i := 0|>/dj >l-2{d- do)e-^ . Now note. 



-P (Jj+i :— Jj°+i G 05" ( J°) I J°) 



> p ( 7^* ,0 - 7^* JO <5n vj°, 1 G cs ( j°) , 6n<n* ,. 



-7^* JO vj'+i ^ C5(J°) ) 



> 1 - V p[n* JO -TV jo>5r 

> l-2\CSiJ°)\e-^ -2\CS{J°y\e-^ = 1 - 2{d - i)e~^ , 



y pin* J. -n* JO <6r 

JUitcs{j°) 



since \CS (J°)| + \CS {J°y\ = d — i. Hence again we have that, 



do 



P {'RFE finds the correct dimensions') > I I (l — 2{d — i)e ^). 

Hence for r = o{n) with r — )• 00, P ( 'RFE finds the correct dimensions') — )■ 1 as n — )• 00. 

(2) To prove the second part of Theorem 12 just observe that if Jgnd is the last cycle of the 
algorithm in RFE, then from (34) in Appendix A. 7 we have that 

P {\t^L,P (/D,^^nd) - T^1,P,t\ <Sn)=P {\nL,P {fD,TU ) " ^I,P,^| < -^n) P ( Jend = J*) 

+ P 



'nL,P ( /D,J-"'ond j ~ '^L,P,T — ^n| Jcnd i^ J*\ P (Jend 7^ J* 



> P (|7^L,p (/d,^a) - n,P,^| < Sn) P (Jend = J* 
do 



>{l-e-^)J{{l-2{d-i)e-^). 



i=0 



So for r = o(n) with r ^ 00, P ( 7^L,p Ud,^^^^^ ) - '^I.p.j- 



<<5„ 



1 with n —7- 00. 



D 



Note: Although (34) in Appendix A. 7 was asserted for t/„, it is of the same order as 5n and they 
only differ by constants. And we also have "^n < ^n, so the proof for the second part of the theorem 
will hold true for Sr,. 



6.3. Proof of Theorem 13. 
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Proof. The proof of Theorem 13 is similar to the proof of Theorem 12. Hence we omit the 
details and note only the needed changes. The main differences include the criterion of elimination 
for the algorithm, the choice of the stopping condition 5n, and the choice of r. 

Note that here we use the following two assertions that, for J°_^-^ G CS{J°), we have from the 






2 



^^/r+i+^^-^UD,A„,/f^°+ij-^^ 



f J° 



2 



proof of (i) in Appendix A. 6 that, A„ 

T^L D [fn uJ° ] < ^n, occurs with probability at least 1 — 2e~'^ and that for Jj+i ^ CS{J°) 

being any other J such that J° C Jj+i with || Ji+i|| = || J°|| + 1, we also have from (27) and 

2 / N 2 

(28) in Appendix A.6 that A„ fn,x„,Hh+i ^j^^, + ^^.^ (^/d,a„,h^»+i J " ^" fD,x,,,H-'f h-'° ~ 
TIl D [fr. uJ° ) > Co — ^n occurs with probability at least 1 — 2e~'^, for e„ = (2c + 24\/2t + 

„ , §_ 4/3 + 1 

48K2a^P)n 2/3+1 + 40™ 2(2,3+1) . 

2/3 

Now if we choose r = 0(7x2/3+1 ) with r — )• 00, then 5„ = En satisfies the above inequalities along 
with the conditions that 5^^ = 0(n 2/3+1) with J„ — )• 0. 

Using similar steps as in the proof of Theorem 12, we have our assertion. D 

7. Case Studies. In this section we show the validity of our results in many practical cases 
of risk minimization by discussing the results in some known settings. 

7.1. CASE STUDY 1: Feature Elimination in Linear Regression. In this case study we present 
our results for the simple setting of linear regression. This example shows that the consistency 
results achieved in this paper can be applied to many different situations ranging from simple to 
complex risk minimization problems and in some cases can substantiate known techniques that 
are in practice in such contexts for feature elimination. Linear regression is one of the most fre- 
quently used statistical techniques for data analysis. It is also a simple example of an empirical 
risk minimization problem. 

In a linear regression model, we assume that the functional relationship can be expressed as 
y = (a,x) + 601 where (a,x) denotes the Euclidean inner product of vectors a and x and 60 is 
the bias. The prediction quality of this model can be measured by the squared-error loss function 
Lis given as LLs{x,y, f{x)) = (/(x) — y)^ and our goal is to find linear weights a and 60 for the 
observed data D that minimize the empirical risk. We assume that the input space A' C i? c M"'. 
We further assume that 3^ C M is a closed set. The functional space Jim is given by Jim = {fa,bo '■ 
fa,bo{x) = {c(,x) + 60, (a, 60) £ M"^^^, ||(a,&o)||oo < M, for some M < 00}. We can now observe 
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that the Assumptions (Bl) - (B5), which are required for the consistency results, are satisfied for 
this problem. Note that (Bl) is satisfied since the Least Squares Loss function L^s is convex, and 
as observed in SC08, Lis is locally Lipschitz continuous when y is compact. (B3) and (B4) follow 
trivially under the premise that X (1 B <Z W^ and 3^ C M is a closed set and the fact that for 
some M < oo, \\{a, &o)||oo ^ M for any function fa,bo within the linear functional class Jim- Since 
J-"iin is non-empty, (B2) follows from (B3). Assumption (B5) imposes an exponential bound on the 
average entropy number. Many analyses have been done on covering numbers for linear function 
classes (see Zhang and Bartlett, 2002; Williamson, 2000) and under quite general assumptions it 
was proved that exponential bounds can be imposed on the e-entropies of such functional classes, 
which is actually stronger than our Assumption (B5) (Refer to Theorems 4 and 5 in Zhang and 
Bartlett (2002)). 

Thus the RFE procedure presented in this paper translates in the linear regression case as a non- 
parametric backward selection method based on the value of the 'average sum of squares of error' or 
B? /n. Indeed, the average empirical risk of the estimator f{x) for the sample is exactly B? /n. In a 
non-parametric setup, under restrictive distributional assumptions on the output vector 3^, the idea 
of using penalized versions of \ogE? like AIC, AICc or BIC are well accepted ad- hoc methodologies 
for model selection (and hence feature elimination), although it is not always trivial to know which 
penalty should be used in a given situation, or which is best in that regard. This paper produces 
a theoretical basis for using the non-penalized criterion E? /n as a tool for feature elimination 
in linear regression. Suppose we start with a set of covariates X = {Xi, . . . , X^} and let's assume 
without loss of generality that the covariates are pre-ordered on the basis of their importance. Then 
Assumptions (Al) and (A2) can be interpreted as claiming the existence of an r G {1, 2, . . . , d} 
such that the following null hypothesis is true Hq : {ad = ■ ■ ■ = Ur+i = 0, a^., . . . , ai 7^ 0}. So this 
paper establishes consistency for RFE based on the criterion E? /n and a pre-determined stopping 
rule in finding the correct feature space Xq = {Xi, . . . , X^} under this null hypothesis Hq. 

7.2. CASE STUDY 2: Support Vector Machines with a Gaussian EBF Kernel. Here we provide 
a brief review of the application of RFE in the classic SVM premise for classification using a Gaus- 
sian RBF kernel. Assume that y = {1, —1}. We want to find a function f : X h^ {1, —1} such that 
for almost every x ^ X, F{f{x) = Y\X = x) > 1/2. In this case, the desired function is the Bayes 
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decision function /£ p with respect to the loss function Lbc{x, y, f{x)) = l{y • sign(/(x)) 7^ 1}. In 
practice, since Lbc is non convex, it is usually replaced by the hinge loss function Lhl{x, y, fix)) = 
max{0, 1 — yf{x)}. For SVMs with a Gaussian RBF kernel, we minimize the regularized empirical 
criterion A||/|p + ^ ^l^i max{0, 1 - ytfixi)} for the observed sample D = {(xi, yi), . . . , (x„, y„)} 
within the RKHS H^{X) with the kernel A;^ defined as k^{x,y) = e 






Lemma 23. For classification using support vector machines with a Gaussian RBF kernel, the 
RFE defined for 5^^ = 0{n^P+^) with 6^0 where /3 = ^o_f^('J^o^^ , with P^ S (0,oo) being 
the margin-noise exponent of the distribution P on M. x { — 1, 1} and r^ G ( 0, oo ] being the tail 
exponent of the marginal distribution Px , is consistent in finding the correct feature space^ . 

In order to prove Lemma 23, we need to verify Assumptions (CI) - (C6) in this setup. First 
note that Assumptions (CI) and (C2) are trivially satisfied since Lhl is Lipschitz continuous (see 
Example 2.27 in SC08). (C3) is also satisfied since Af G M is separable and since an RKHS over 
separable metric spaces having a continuous kernel is separable (Lemma 4.33 of SC08), hence H^ 
is separable. It is also easy to see that \k^[x,y)\ < 1 is true for all x,y £ X and all 7 > and hence 

\\h II < 1 

1 1 "'7 1 1 00 _ ^• 

From the proof of Proposition 18 (and also results in chapter 7 of SC08) we can see that As- 
sumption (C4) can be replaced by the potentially weaker Assumption (C4*). 

(C4*). Assume that for fixed n > 1, 3 constants a > 1 and p £ (0, 1) such that for any J C 
{1,2, ... ,d}, ED;,^Pjiei [id : H-^ ^ L2{Dx)) <ai-^p, i>l. 

It is easily seen from the steps in (20) in Appendix A. 5 that instead of assuming (C4) we can assume 

(C4*) and the results will hold. Then we see that Theorem 7.34 with Corollary 7.31 of SC08 along 

with the fact that d/{d + r) is an increasing function in d, yields a bound as given in (C4*) with 

(i-p)(i+,)d^ 
a := maxrfj<rfCe^p7 ^p for all e > 0, d/{d + t) < p < 1 and a constant c^^p depending only 

on p and a given e. We however preferred to use (C4) instead of (C4*) in our theoretical derivations 

because it can be potentially weaker in many situations. 

Assumption (C6) follows from results obtained in Theorem 8.18 of SC08 (see also Theorem 2.7 

in Steinwart and Scovel, 2007). Note that Assumption (C6) is not required for consistency results. 



^For a discussion on margin-noise exponents and tail exponents of a distribution refer to Chapter 8 of SC08. 
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as we already have that A2{X) — t- when A — t- from Lemma 5.15 of SC08. (C6) helps us to obtain 
explicit rates for the RFE and we show that it holds here which will help us to derive rates in this 
framework. Without going into explicit details, we can see from Theorem 8.18 of SC08 that the 
approximation error for a SVM using Gaussian RBF kernel of width 7 on M"' can be bounded by 
a function given as 

where P is a distribution on M^ x {—1, 1} that has margin-noise exponent 13^ G (0, c«) and whose 
marginal distribution Px has tail exponent r^ G (0, 00], Cd^T^-iCd^p^i > are constants and Cd 

is the constant occurring in equation (8.10) in SC08. So for a given pair (A, d) if we choose 

^d Pd^d 

7(A,d) = X'^i^d+d^d+^d^d then it can be seen that A2{\,d,^{X,d)) < X'^i^d+'^^d+^d^d (where < denotes 

'less than or equal to' up to constants). Hence the bound in Assumption (C6) is satisfied for any 

J. 

So for a sequence of SVM objective functions A„||/|||^ + ^ X^ILi i^axjO, 1 — yif{xi)} defined 

for a sequence A~^ = o{n) with A„ — )• the assumptions for the theoretical results on consistency 

of RFE are met, and thus Lemma 23 is proved. 

8. Simulation Study. In this section we present a short simulation study to illustrate the 
use of risk- RFE for feature elimination in SVMs. Note that the use of RFE for such purposes has 
been in practice for well over a decade and it is a well-accepted technique in classification. The 
main aim of the simulation study here to evaluate our consistency results. 

We consider two different data-generating mechanisms, one in the classical classification setting 
and the other in regression. For each of these examples we again look at three different scenarios. 
For the first scenario, the total number of covariates is 15 of which only 4 are important. For the 
second scenario, there are 30 covariates with only 7 important ones. The third scenario has 50 
covariates with 3 that are important. 

For the classification example we consider the hinge loss Lhl as the surrogate loss and the 
SVM function is computed using the Gaussian RBF kernel k^{xi,X2) = exp{ — -!j||xi — a;2||2}- 
The covariates X were generated uniformly on the segment [—1, 1] and the output vector Y was 
generated as Y = sign(ci;'X), where lo is the vector of coefficients with the first few elements 
non-zero, corresponding to the important features, chosen at random from a list of coefficients 
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d — 15, do 



d = 30, do = 7 



d = 50, do = 3 



SVM with Gaussian Kernel 



n=100 n=200 n=400 n=100 n=200 n=400 n=100 n=200 n=400 



Prop of times no errors (a) 0.97 1 1 0.62 1 1 0.95 1 1 

Prop of times 1 error (b) 0.03 0.34 0.05 

Prop of times > 1 error (c) 0.04 



SVR with Linear Kernel 

Prop of times no errors (a) 
Prop of times 1 error (b) 
Prop of times > 1 error (c) 



d = 15, do = 4 d = 30, do = 7 d = 50, do = 3 

n=100 n=200 n=400 n=100 n=200 n=400 n=100 n=200 n=400 



1 


1 


1 


1 


1 


1 


1 


1 


1 

























































Table 1 . Accuracy of Risk-RFE 

[—1, —0.5,0.5, 1] and the rest are zero. We initialize the original SVM function using a 5-fold cross 
validation on the kernel width 7 and the regularization parameter A and they were chosen from 
the set of values 



(16) 



riA' 



7 



(0.01 X W\ j) 



{0,1,2,3,4}, j = {1,2, 3, 4} 



where n is the sample size for the given setting. 

In the second case we used an SVR function with a linear kernel k{xi,X2) = (3^1,3^2) to treat 
the regression setting. The loss function we considered is the e-insensitive Loss L^{x,y, f(x)) = 
max{0, \y — f{x)\ — e} with e = 0.1. Covariates are generated as before while Y is now generated 
as y = u^'X + 3-^dim(x)(0) !)• As before we initialize with a 5-fold Cross Validation on A. 

We repeat the process for different sample sizes n = {100, 200, 400}. We also repeat the simula- 
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Fig 1. Reverse Scree Graph for one run of the simulations for (a) SVM with Gaussian Kernel (b) SVR with Linear 
Kernel with d = 30, do = 7 
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Reverse Scree Graph for SVM with Gaussian Kernel 
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Reverse Scree Graph for SVR wifh Linear Kernel 
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Fig 2. Linear- Quadratic mixture change point analysis for (a) SVM with Gaussian Kernel for comparable cross 
validation values of X and kernel width 7 and (h) SVR with Linear Kernel for comparable cross validation values of 
X, with d — 30, do = 7 for varying sample sizes. The bold dots represent the estimated change points. 
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Reverse Scree Graph for SVR with Linear Kernel 
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Fig 3. Linear- Quadratic mixture change point analysis for (a) SVM with Gaussian Kernel for comparable cross 
validation values of X and kernel width 7 and (b) SVR with Linear Kernel for comparable cross validation values of 
X, with d — 50, do = 3 for varying sample sizes. The bold dots represent the estimated change points. 



tions 100 times each to note down the proportion of times the Risk-RFE made no errors (a), made 
only one error (b) or made more than 1 error (c) (See Table 8), where a mistake is made if the 
rank of any non-important feature is found to be higher than that of any important one. The entire 
methodology was implemented in the MATLAB environment. For the implementation we used the 
SPIDER library for MATLAB'^, which already has a feature elimination algorithm based on RFE 
and we modified it accordingly to suit our criterion for reduction. The codes for the algorithm and 
the simulations are given in Supplement A. 



^The Spider library for Matlab can be downloaded from http://www.kyb.tuebingen.mpg.de/bs/people/spider/ 
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One important question we inevitably face in feature elimination is when to stop. Note that 
our theoretical results suggest the existence of a gap eo and our results show that asymptotically 
the difference in the empirical versions of the objective functions exceed it whenever we move 
beyond the correct dimension. Practically it is almost impossible to characterize this gap for a 
given setting, but the existence of this gap can be observed from the values of the objective 
function at each stage of the algorithm. One idea that can be implemented is that of a 'reverse 
Scree graph' (See section on Scree graphs in chapter 6 of Jolliffe (2002)). Implementation of the 
Scree graph is a well-formulated idea in choosing the correct number of Principal Components in 
PCA and that same idea can be applied here as well. We plot the values of the objective function 



fo X _f/JLJ{i} + TIl,d ( Id a i/JuiO ) at each run of the algorithm in a graph. Figure 1 



inf A 

i&Z\J 

justifies such an argument. 



For a further exploratory analysis of this gap and to characterize the number of features to 
be eliminated, we tried some ad-hoc model diagnostic tools. From a heuristic standpoint, the 
phenomenon captured in Figure 1 seems to suggest that if we fit a regression model to the observed 
objective function values in the scree plot, we will expect a change in the slope of the regression 
line right after we start eliminating significant covariates because of the aforementioned gap. One 
plausible way to analyze this gap is to fit a change point regression model of the observed values on 
the number of cycles of RFE and to infer that the estimated change point is the ad-hoc stopping 
rule, so as to eliminate all features ranked below that point. For the asymptotic belief that the 
change in the objective function is negligible to the left of the change point, we fit a linear trend 
there. However to the right of the change point, these changes might show non-linear trends, and 
hence we tried linear and quadratic trends to model that. The quadratic trend seemed to work 
better. Some plots (see Figures 2, 3) are given here to show our analysis where we show the mixture 
of linear-quadratic fits. 

So heuristically it is possible to justify the choice of the correct dimensions based on a reverse 
Scree graph. Otherwise it is a user-determined choice for the gap size to determine how many 
dimensions are required in a specific setting. 

9. Discussion. We proposed an algorithm for feature elimination in empirical risk minimiza- 
tion and support vector machines. We studied the theoretical properties of the method, discussed 
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the necessary assumptions, and showed that it is universally consistent in finding the correct feature 
space under these assumptions. We provided case studies of a few of the many different scenarios 
where this method can be used. Finally, we give a short simulation study to illustrate the method 
and discuss a practical method for choosing the correct subset of features. 

Note that Lemma 21(iii) and Lemma 22(iii) establish the existence of a gap in the rate of 
change of the objective function at the point where our feature elimination method begins removing 
essential features of the learning problem. This motivated us to use a scree plot of the values of the 
objective function at each cycle, and indeed our simulation results support our approach by visually 
exhibiting this gap. Moreover, the graphical interpretation of the scree plot motivated the use of 
change point regression to select the correct feature space. It would be interesting to conduct a more 
detailed and formal analysis of this gap in real life settings to facilitate more efficient, automated 
practical solutions. 

As far as our knowledge goes, not much analysis have been done on the properties of variable 
selection algorithms under such general assumptions on the probability generating mechanisms of 
the input space, especially in support vector machines. So the results generated in this paper can 
act as a good starting point for similar analyses in other settings. One useful extension would 
be to study the properties of RFE under the high dimensional framework of a 'large p small 
n problem'. It would also be interesting to analyze RFE for other settings, including censored 
support vector regression (See Goldberg and Kosorok (2013)) or other machine learning problems, 
including reinforcement learning or other penalized risk minimization problems. 

APPENDIX A: PROOFS 
A.l. Proof of Lemma 5. 

Proof. The direction £j^(A''^) C Coo{X^) is obvious since co-ordinate projection maps are 
continuous. To show that £^(A'-^) D Coo{X^) let us take g G Coo{X-^). Then g : A'"' i-^ M is 
measurable with ||g||oo < oo- Extend 5 to ^ to include the whole domain X by defining g{x) = 
g {tt'^" {x)^. Since Ij is measurable with H^Hoo = llslloo; we have that Ij G Coo{X) and Ijo -k-^" = g, so 

A. 2. Proof of Lemma 6. 
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Proof. (1) For any function / G Cooi-^), by the denseness of J-' we can find a sequence of 
functions {g-n} S J' such that Qn ^^ f uniformly. Now fix an arbitrary function / G C'l^{X) C 
C'ooi'^) and consider any sequence of functions {gn} € ^ that converges to / uniformly. Construct 
the new sequence of functions {5^} where for any function f ^ F, f^ is defined by /"^(x) = 
f{TT-^\x)). Observe trivially that {g^} G -T^"'- 
Now {gn} I— ^ / uniformly =^ for any e > 0, 3 A^ such that W n > N, 

sup \gn{x) — f{x)\ < e \/n > N =^ sup \gnix) — fix)\ < e \/n> N 

=> snp\gn{7T-^\x))-f{7T^\x))\<e Vn>iV 
xex 

=> snp\g^ix)-f{x)\<e Vn > iV (•.• f{n'\x)) = fix)) 
xex 

=^ {9n} ^ f uniformly. 

Hence T'^ is dense in C'^{X). 

(2) Since T is compact, for any e > 0, 3 {/n}„=i G -^ such that .F C M ]B||.||^(/„, e) (where 

n=l 

Bj|.||^(/„, e) is a II • ||oo ball of radius e with center /„). We now fix / € -F'^ and note that 3 an 
equivalent class of functions {g-^} in T such that for any two functions gf and g2 G {g-^} we have 
that gi ~ (72 ™ the sense that (/( o vr = gi^oTT = /. Fix one such ^•' G {fi'-' }• Since g-' £ J-, 3 fi 
e {/n}^=i such that d{fi,g-f) < e, that is, 

sup \fi{x) - g^{x)\ < e => sup \fi{x) - gHx)\ < e 
^ sup|/,(vr-^^(x))-5^(7r-^^(x))|<e ^ sup |//(x) - /(x)| < e (; : gf {^J\x)) = f {x)) 

x^X x&X 

^ {/n }n=i forms a finite e-cover for the set F . 

Hence F is compact. 

(3) To see (3), note that if /i, . . . , /2"-i is an e-net of J-", then for any / G J-", we have i G 
{!,..., 2"-!} such that ||/ - /i||oo < e. Then, 

11/ o vr-^' - /i o vr-^loo = sup |/ o 7r'^'(3;) - /i o ■k-^\x)\ = sup |/(x) - fi{x)\ 

x&X xi^X-J 

< sup|/(x) -/i(x)| = ||/-/j||oo < e- 
xex 

Hence /i o ir-^" , . . . , f2«-i o tt-^" is an e-net of F'^ . D 
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A. 3. Proof of Lemma 14. 

Proof. Note that if we define gf := L o f — Ep{L o /), then Q = {gj : / G T} is a separable 
Caratheodory set (for a discussion on Caratheodory families of maps, refer to Definition 7.4 in 
SC08). To see this, first note that ||5/||oo < sup \L o f — Ep{L o /)| < 2B for B defined in 

{x,y)£Xxy 

the statement of the Lemma. Also by assumption, || • \\jr dominates the pointwise convergence of 
functions (so /n — ^ / in || • ||j- =^ /« — )• / pointwise). Then the fact that L is locally-Lipschitz 
continuous coupled with Lebesgue's Dominated Convergence Theorem (since ||Lo /||oo < B) gives 
us the above assertion. 

Now note that Ep(gf) = and Epg^r < (2B)^ = AB^ for B as before, so we can apply the 
Talagrand's Inequality given in Theorem 7.5 of SC08 on G defined as G : Z" = {X x y)^ i— )■ M 
such that 



1 " 
and hence, for 7 = 1 and for all r > 0, we have 



(17) G{zi,...,Zn) ■■= sup 



sup|7^L,D(/)-^L,p(/)|, 



(18) P" [ L G ^'^ : G{z) > 2Epn{G) + 2bJ— + ^^ i j < e'^ 

So now we need to bound the term Epn(G) := Epn < sup \TIl £>(/) — T^L p{f)\ \- 

Defining the new Caratheodory set % asT-L = {hf := Lof : f £ J^}, for a probability distribution 
P on ^ = (A' X 3^), we can use the idea of symmetrization given in Proposition 7.10 in SC08 to 
bound Epn < sup \TZl Z)(/) — T^L p(/)| \- We have for all n > 1, 

Eor^p^ { sup \RL,D{f) -T^L,p{f)\ } = Eor^pn sup \Ephf - E^hfl < 2EDr^P"-RadD{'H,n), 
[feT J hfen 

where RadD(^,'^) is the n-th empirical Rademacher average of Ti for D := {zi, . . . ,Zn} £ Z^ 

with respect to the Rademacher sequence {ei, . . . ,en} and the distribution i', which is given by 



Rad£){'H, n) = Ey sup 
hen 



1 " 

E£ih{2 
... 



n 

i=l 



So we see now that it suffices to bound -E'£)^piRad£)(?^, n). 
For that we use theorem 7.16 of SC08, but before that note from (C4) that we have that for 
fixed n > 1, that 3 constants a > 1 and p G (0, 1) such that 

(19) ¥.D;,^p^ei{F,L^{Dx))<ai-^^, i>l. 
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First observe that % C C2{P)- Now since Lipschitz continuity of L gives us that \L{x,y, fi[x)) — 
L(x,2/,/2(x))p < CL{Cf\h{x) - /2(x)|2, it is easy to see that 
ei(^, II • IIl2(p)) ^ CL{C)ei{F, \\ ■ \\l2{Px))- Hence we have 

(20) En^pr. {e.i'H, \\ ■ h.iD))) < cUQEo^^pn {ei{J^, \\ • \\l,(D;,))) 

< cl{C)Ed;,^p^ {ei{T, II • \\l^[Dx))) 

< CL{C)ai~^. 



Now noting that ||/i/||oo < B and Eph?r < B^ for B defined as before, the conditions of Theorem 
7.16 of SC08 are satisfied with a = CL{C)a and hence we have, 

(21) £;z?^pnRadD(^,n) < max |Ci(p)5P5^"Pn-^, C2(p)a^S^n"TT^| 

for constants Ci(p), C2{p) depending only on p. Hence we finally have, that with probability 
> l-e"^, 

sup |7^L,p(/) - nLAf)\ < ^bJ- + —— 

f^jr y n 3n 

+ Am.axWi{p)cL{C)PaPB^ ^n -i ,C2{p)cL{C)^+pa^+v B^+vn ^+p > . 

D 

A. 4. Proof of Proposition 15. 



Proof. For o^p > n, note that |7^L,D (/d,j--^2) - 'R-l.d {fD,Th)\ < '^l,d (/d,j--'2)+'^l,d (/d,j--^i) 
< 2B < 2AKi < 2AKi h^\ for Ki > B/4. Hence we assume a^P < n. Now note that for 
Jii J2 ^ J such that Ji C J2 C J^, we have, 

+ \'^L,P,T-'i - '^L,P {fD,Fh) I + \T^L,P {fD,FJi) " '^L,D {fD,Fh) 

EE Ai + Bi + Bl + Al 

since from Assumption (Al) we have that TIl,p {fpjrJi) = TIl,p {fpjrJ2) for Ji ^ J2 C J^. 

For A\ and A^^ we have, \TZl,d {foTJi) - T^l,p (/d J-J2) I < sup |7^l,p(/) - TZL,D{f)\ and 

/eJ--^2 

|^L,D (/d^^i) -^L,P UdT^i) I < sup |7eL,p(/) -7^i,^(/)|. 
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Now for Bi and B^ we have from (6.13) of SC08 that, 

\'^L,p{fD,TJ2)-'^L,p{fp,TJ2)\=nL,p{fDT^2)-KpTJ2 <2 sup |7^L,p(/)-^L,D(/)|, 
\T^L,P {fD,TJi) - T^L,P (/p,^a) I = nL,P {fD,T-h) - K p^j, < 2 sup ITZlAI) - T^L,D{f)V 



,,C:=M and 



Now see from Lemma 6 that the conditions of Lemma 14 are satisfied for || • || j- 
B := B for each of the functional classes J- . Also since a^^ < n, we have ( ^ 
for p G (0, 1). Hence we have our assertion. 

A. 5. Proof of Proposition 18. 

Proof. First note that since S > 1 and K > B^/A, we have 24:KB^'P > 6B > 2. Now if 
a^^ > X^n, the inequality trivially follows from the fact that 



< 



^\fDXH-'A\H-'2 +'^L,D {fD,\,H'2) + ^\f DXH-'A\h-'i +^L,D (/d,A,H-^i ) 



< 27^L,B(0) < 24KB^-P 



,2p\ 2 



XPfi 



since '7^l,d(0) < 1- Hence we assume from here on that a^^ < X^n. Now observe that since H is 
separable, from Lemma 8 we have that the H'^s are also separable. Hence from Lemma 6.23 of 
SC08 we have that the SVMs produced by these RKHSs are measurable. 

Now note that L{x, y, 0) < 1 =^ for any distribution Q on ^xj^, we have that TZl^q{0) < 1. Since, 
'^^^jMIIWhJ +'^L,Qif) < TZl,q{0), we have that ||/q,a,hHIhj - V ^^'a • ^°^ ^™^^ ^^ Lemma 
4.23ofSC08 ll/IU < ||A:||oo 11/11,^./ for all / G i?^, we have that W/q^xmAL ^ I^QXhAIh-J ^ ^~^'^- 
So, consequently, for every distribution Q on <Y x 3^, we have 



(22) 



Now, 



\^L,P UqXH') - T^L,D ifQXHj) I < sup |7^L,p(/) - 7^L,D(/)| • 



HhJ 



<A-i/2 



^\\fDXHJ2\\jjj2 +'^L,D {fDXHJ2) - ^WfoXH'hWjjj, -'T^L^D [foXH-h, 



< 

+ 






A NOTION OF CONSISTENCY FOR RFE IN SVMS AND ERMS 



39 



since from (Al), TV^pjjj^ = T^*l^p^h-'2 = '^*l,p,h- Noting that 
^\\fD.\,Hj\\iJj~^'^L,p{fD,\,H-j) ~'^*LPHJ ^ 0; we have from (6.18) of SC08 that 

^ ||/d,A,H-^||/^j +'^L,P {fD,X,Hj) - '^L,P,H-' 
< A^ (A) + 7^L,p {fD,X,Hj) - T^L.D {fD,\,Hj) + '^L,D {fp,\,Hj) " '^L,P {fpXHj) 

(23) <Ai{\) + 2 sup |7^L,p(/)-7^L,D(/)|. 

II/I1hJ<a-i/2 

From (22) and (23) and the fact that Ji, J^^ J such that Ji ^ J2 C J^, we have that 

< Ai'{X)+Ai^X) + 3 sup |7^L,p(/)-7^L,D(/)|+3 sup |7^L,p(/) - 7^L,z)(/)| • 

First note that for / G X-^/'^B^j and B := cl(A"^/2)A"^/2 + 1, we have \L{x,y, f{x))\ < 
\L{x,y, f{x)) — L{x,y,0)\ + L{x,y,0) < B for all {x,y) £ X x y. Also note that Assumption 
(C4) implies that Ed;,^pu {ei.{X-^/^BH, \\ ■ \\l^(D;,))) < X-^/^ai'^. 

Now note from Lemma 8 that the conditions of Lemma 14 are satisfied for T := X^^''^B^j 
II . \\jr := II . Il^j, C := A-1/2 and B := cl(A-i/2);^-i/2 _^ ^ ^^ g^pj^ ^f ^.j^g RKHS classes if-^. Also 
since a^P < X^n and B > 1, we have ' " *" * --, I a p \ 



> 



i-p 



Hence we have our assertion. 



and B^-P > B^+~p for p E (0,1). 

D 



A. 6. Proof of Lemma 22. 

Proof, (i) Fixing a A G [0,1], we have that B := cl(A"^/2)_x-i/2 + 1 < 2A"^/2. Now since 
|X| < X ^ X < X for any x > 0, we see from Proposition 18 that 

< 4HA) + A'^X) + 24A-V%/^ + 40A-V2I + 48i^2A-'^ ( ^^ ' 

V n n V X^n . 



(24) 



A:^^{X) + ^2'(A) + 24\/2^(An)-5 + 40r(Ain)-^ + 48K2a^P{Xn)~^ 



with probability at least 1 — 2e '^. Also from Corollary 19, for J £ J similarly, we have 



(25) 



< A^iX) + 12\/27(An)"5 + 20r(A5n)"^ + 24/^2 a^P( An) "^ 
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with probability at least 1 — e~'^ . Then under Assumption (C5) and Lemma 5.15 along with (5.32) 
of SC08 we obtain that the right hand side of the above inequality converges to 0. So the denseness 
assumption of the RKHSs additionally gives us the universal consistency of our feature elimination 
algorithm. To establish the convergence rate of our algorithm we further assume (C6) that there 

exists c > and /? E (0, 1] such that A2 < cX" for any J and for all A > 0. Then it can be seen 

1 
that asymptotically the best choice for A„ in (24) or (25) is a sequence that behaves like n (^/s+i) 

and then the inequalities in (24) and (25) are satisfied with the l.h.s. replaced by e^ and e„/2 

. §_ 4,3+1 

respectively, where e„ is given by (2c + 24v 2t + 48-K'2a )^ '^^^^ + 40rn 2(2/3+1) _ This proves (i) 

for {e„} for a suitable choice of r. 

(ii) Observe from Corollary 19 along with Assumptions (C5), (C6) and steps in proof of (i) given 

above that, 



(26) 



^n ||/d,A„,hHIhJ +T^L,D {fD,X„,Hj) -'^L,P,H'' ^ ^"/^ 



occurs with P" probability greater than 1 — e^ for any J C {1, 2, . . . , d} where e^ is given as before. 
Also note that from Assumption (A2) we have that T^*TpTr.j2 — eo ^ '^*lphJ* ~ '^*LPH'h- ^° 
for H'^'^ we have, 

<6n] > 1 - e"^ 



(27) 

and for if ^ we have 

P^ [dg{Xx yr : \Xn \\fD,X^,H-h WIj, + nL,D (/d,A„,H-^i) " K,P,Hh | < ^n) > 1 " e"" 
^ P" (l) e (A- X 3^)" : A„ \\fD,X„,H-'i \\l.H + '^L,D ifD,X„,HH) < K,P,H-H + '^n) > 1 " e'^ 

(28) 
^ P" (l) e (Af X 3^)" : Xn \\fDM,H^i llL/1 + ^^.^ ifD,x„,Hh) +60" 5„ < Kphj^) > 1 - e-\ 

Then (27) and (28) from above jointly imply that 

(29) Xn ||/d,A„,H-^2 \\hJ2+ ^L,D {fD,X„,H-'2) " ^n ||/d,A„,H-'i Wh-Ji ~ ^L,D {fD,X„,Hh) > eo - 2(5„ 



with P" probability greater than 1 — 2e ^. 



A NOTION OF CONSISTENCY FOR RFE IN SVMS AND ERMS 41 

Also it is easy to see that since £„ — 5- with n — )• oo, the gap ?n = eo — Cn — > eo > 0. 

(iii) From Assumption (Al), Corollary 20, additional Assumptions (C5), (C6) and steps in the 
proof of (i) given above, the 'if condition of (iii) follows since for any J and for all e > 0, r > 
and n > 1 we have. 



< ??„ > 1 - e ^, 



(30) P" (d G (^ X yr ■■ \Xn \\fD,X„,HA\lj + ^L,P {fD,X„,Hf) " nlp,H 

, „ /3 4/3+1 

where ^n = (c + 8V2t + lQK2a^^)n 2/3+1 + 40/3rn 2(2/3+i) . 

Now for J\ ^ J and J2 ^ J ^e have TZ* p-p.j2 — ^o ^ ^^ pj="J* ~ ^1 pj^'-^i ^'^'^ hence the 'only 
if condition of (iii) also follows since 



P" (D G (A- X yr : A„ ||/b,,„,^./. \\%,^ + nL,P {fD,x^,H'.) - K,p,H^i >eo-6n)>l- e-- 

(31) 

^ P" (d G (A- X yr ■■ \n \\fD,X^,H^2 £., + ^L,P (/d,A„,//^2) " n,P,H > e, -~5n) > I - e'^ 



Now since ?]« — )■ with n — )• 00, the gap ?n = ^o — ^n — ^ eo > 0. D 

A. 7. Proof of Lemma 21. 
Proof, (i) Note that Proposition 15 implies that we have with probability at least 1 — 2e~^, 

(32) ^ZL,D{fn,T.^)<nL,D{fn,TH)+^'^B^- + —- + 24K^^—j . 

This trivially establishes (i) for en = 12B\/2Tn''^^'^ + 20BTn^^ + 24i^iaPn~^/^ for sufficiently large 
r. 

(ii) Corollary 16 gives us that \TZl,d {fD,^-^) ~ '^*l pj^jI < ^nf^ with probability at least 1 — e~'^ 
for any J C {1, 2, ... , d}, where e„ is as defined in (i). 

Now, following similarly as the steps given in the proof of Lemma 22 in Appendix A. 6, we obtain 
that P" {De{Xx yr : 7^L,D (/d.^f-'z) - T^l,d {fD,Th) > eo - £«) > 1 - 2e~^ ■ It is easy to see 
now that since e„ — ?■ with n — )■ 00, the gap ?« = eo — e„ — > eo > 0. So this establishes (ii). 

(iii) Corollary 17 along with Assumption (Al) gives us that for any J, 

(33) P'^ (I) G (A- X 3^)" : \nL,P {fD,Tj) - n^p^^l < r?„) > 1 - e-^ 

where ijn = 4P\/2Tn~"'^' ^ + 20/3PTn^"'^ + 8X1 a^n"^''^ and hence the 'if condition of (iii) follows. 
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Now for J\ ^ J and J2 ^ ^7 we have 7^^ ^ j^ — eo > TVj^p-pj, = '^*lptJi: ^-iid the 'only if 
condition of (iii) follows since 

(34) P" (d G (A- X yr : nL,P {fD,Th) > '^Xp,t^. + eo - r?n) > 1 - e-^ 

since r^n — >■ with n — ;■ 00, so the gap ?„ = co — Vn — > eo > 0. D 

SUPPLEMENTARY MATERIAL 

Supplement A: Matlab Code 

(http://www.bios.unc.edu/ kosorok/RFE.html). Details on the codes are given in the html page. 
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